In this module, we work through an example of a GWAS analysis using a marginal approach. Our approach is ‘marginal’ in the sense that we are testing the relationship between the outcome and the genetic data by going one SNP at a time. This is (by far) the most pervasive approach in the existing literature.
This module does not require that you have worked through the previous modules; however, I use language from the quality control, imputation, and population structure modules throughout my explanation of this marginal approach.
As always, we begin by loading the libraries with the tools we need for analysis.
library(data.table)
library(magrittr)
library(qqman)
library(snpStats)
library(dplyr)
Next, we load the data. Start by loading the clinical data, as this has the outcome (coronary artery disease (‘CAD’)) we need for our models.
clinical <- fread("data/GWAStutorial_clinical.csv")
# str(clinical) # if you need to remind yourself of what is here
Next, we need to load the genetic data. For this section, we will
work with the quality controlled (QC’d) data from the
SNPRelate package (see module 1). We will also need the
“.bim” file from the original data for making plots.
# load QC'd data:
qc_dat <- readRDS('data/gwas-qc.rds')
# load the bim file
bim <- fread('data/GWAStutorial.bim')
If you completed the population structure module, you should load the principal components as well - we need them for our analysis. If you did not work through the population structure module, you can skip this step.
# load principal components
PCs <- readRDS(file = "data/PCs_base.rds") %>%
as.data.frame()
names(PCs) <- paste0("PC", 1:ncol(PCs))
We begin our analysis with a logistic regression model which uses the
predictors sex and age to study the binary outcome CAD. With the
function snp.rhs.tests(), we will perform a large set of
logistic regression tests and save the p-values in a vector called
assoc_p_vals. In addition to adjusting for sex and age, we
will adjust for the first 4 principal components, as per the
observations we made about the population structure in our data.
assoc_test <- snpStats::snp.rhs.tests(
formula = clinical$CAD ~ clinical$sex + clinical$age +
PCs$PC1 + PCs$PC2 + PCs$PC3 + PCs$PC4,
family = "binomial",
link = "logit", # echoes the default settings
data = qc_dat$fam,
snp.data = qc_dat$genotypes
)
assoc_p_vals <- p.value(assoc_test)
Alternatively, one could implement this testing using the imputed
data. There is an optional rules argument in the
snp.rhs.tests() function that allows the user to include
the rules used for imputation. I will forgo doing this here – simply
want to point out that this approach is possible.
There are two methods for visualizing the results of a GWAS analysis: qq plots and Manhattan plots. Both of these tools are meant to highlight the genetic variants (SNPs) with the smallest corresponding p-values.
Due to their magnitude, p-values from GWAS studies are typically illustrated on the log-transformed scale.
As a first look at our results, we will examine a qq-plot. This plot
compares the observed p-values from our SNP tests with the p-values we
would expect if there were no associations. To make our qq-plot, we will
use the qq() function from the package
qqman.
qq_plot <- qqman::qq(assoc_p_vals)
(qq_plot)
# NULL
If the observed p-values follow exactly the distribution we would expect under the global null hypothesis, then all points will be plotted on the red line.
We notice in this qq-plot that some of the p-values (plotted as points) diverge from the red line to the left. This indicates that some observed values are below what is expected. From a statistical point of view, this is pretty promising. Having some p-values that are smaller than expected indicates that there are likely to be significant results in the data.
A Manhattan
plot makes the smallest p-values “pop” by plotting all p-values so
that the smallest ones are the highest. We can create a Manhattan plot
of our results using the manhattan() function (from the
aforementioned qqman package)
# format data to have readable labels.
manh_data <- data.frame(
SNP = assoc_test@snp.names, # NB: for S4 objects in R, use the "@" to access items
P = assoc_p_vals
) %>%
left_join(bim, by = c("SNP" = "V2")) %>%
rename(
CHR = V1,
BP = V4
) # recall that 'bim' files have a standardized format, so the column order is
# always the same
manh_plot <- manhattan(manh_data, ylim = c(0, 20))
(unlist(manh_plot))
# xpd
# FALSE
Here, the x-axis of the plot is divided into “bins”, where each bin represents a chromosome. The y axis shows the negative, log-transformed p-values. Each of the p-values from our analysis is plotted in the bin corresponding to the chromosome of the gene in that particular test, and the height of the point directly correlates with the significance of the test.
The goal of Manhattan plots are to help identify SNPs (or a region of SNPs) that are associated with a phenotype of interest. The blue and red lines represent values for \(-\text{log}_{10}\) transformed p-values at two specified thresholds of “significance.” When we do have SNPs with a p-value that exceeds these lines, we are often interested in the one that is the highest in a given region.
I could improve the Manhattan plot above by adding annotations which indicate the SNPs with the smallest p-values:
# NB: 5 × 10e−8 is a common threshold for significance in GWAS studies,
# whereas 5 x 10e-6 is a common threshold for "suggestive" results
# signif_threshold <- 5e-8 # this would be a more stringent alternative
suggest_threshold <- 5e-6
manh2 <- manhattan(manh_data, ylim = c(0, 10), annotatePval = suggest_threshold)
(unlist(manh2))
# xpd
# TRUE
Based on the Manhattan plots, one region of interest could be around
rs9632884 on Chromosome 9 - this SNP is above the suggestive threshold
(indicated by the blue line), and there are a lot of other notable
results clustered near this SNP. If I wanted to highlight a region of
interest, I could do this to highlight the specific genes in this
region, or locus. This
will also allow us to practice the final piece of functionality from
qqman:
bp_center <- manh_data %>% # NB: 'bp' stands for 'base pair'
filter(SNP == "rs9632884") %>%
pull(BP)
bp_range <- c(-1, 1) * 100000 + bp_center
snps_highlight <- manh_data %>%
filter(BP >= bp_range[1], BP <= bp_range[2], CHR == 9) %>%
pull(SNP)
manh3 <- manhattan(
manh_data,
ylim = c(0, 10),
annotatePval = .000005,
highlight = snps_highlight
)
(unlist(manh3))
# xpd
# TRUE
The above is a simple outline of tools for summarizing GWAS results
in R. There are other tools available for implementing GWAS
analyses - one popular software is called PLINK. The PLINK software is made up
of command line programs that implement a wide array of GWAS (and whole
genome) analyses. This
tutorial provides a place to begin for new PLINK users, and this other
tutorial is well-known and more comprehensive.